Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 14.251
Filtrar
1.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38600663

RESUMO

Protein sequence design can provide valuable insights into biopharmaceuticals and disease treatments. Currently, most protein sequence design methods based on deep learning focus on network architecture optimization, while ignoring protein-specific physicochemical features. Inspired by the successful application of structure templates and pre-trained models in the protein structure prediction, we explored whether the representation of structural sequence profile can be used for protein sequence design. In this work, we propose SPDesign, a method for protein sequence design based on structural sequence profile using ultrafast shape recognition. Given an input backbone structure, SPDesign utilizes ultrafast shape recognition vectors to accelerate the search for similar protein structures in our in-house PAcluster80 structure database and then extracts the sequence profile through structure alignment. Combined with structural pre-trained knowledge and geometric features, they are further fed into an enhanced graph neural network for sequence prediction. The results show that SPDesign significantly outperforms the state-of-the-art methods, such as ProteinMPNN, Pifold and LM-Design, leading to 21.89%, 15.54% and 11.4% accuracy gains in sequence recovery rate on CATH 4.2 benchmark, respectively. Encouraging results also have been achieved on orphan and de novo (designed) benchmarks with few homologous sequences. Furthermore, analysis conducted by the PDBench tool suggests that SPDesign performs well in subdivided structures. More interestingly, we found that SPDesign can well reconstruct the sequences of some proteins that have similar structures but different sequences. Finally, the structural modeling verification experiment indicates that the sequences designed by SPDesign can fold into the native structures more accurately.


Assuntos
Redes Neurais de Computação , Proteínas , Alinhamento de Sequência , Sequência de Aminoácidos , Proteínas/química , Análise de Sequência de Proteína/métodos
2.
Nat Commun ; 15(1): 2775, 2024 Mar 30.
Artigo em Inglês | MEDLINE | ID: mdl-38555371

RESUMO

Homologous protein search is one of the most commonly used methods for protein annotation and analysis. Compared to structure search, detecting distant evolutionary relationships from sequences alone remains challenging. Here we propose PLMSearch (Protein Language Model), a homologous protein search method with only sequences as input. PLMSearch uses deep representations from a pre-trained protein language model and trains the similarity prediction model with a large number of real structure similarity. This enables PLMSearch to capture the remote homology information concealed behind the sequences. Extensive experimental results show that PLMSearch can search millions of query-target protein pairs in seconds like MMseqs2 while increasing the sensitivity by more than threefold, and is comparable to state-of-the-art structure search methods. In particular, unlike traditional sequence search methods, PLMSearch can recall most remote homology pairs with dissimilar sequences but similar structures. PLMSearch is freely available at https://dmiip.sjtu.edu.cn/PLMSearch .


Assuntos
Evolução Biológica , Proteínas , Proteínas/química , Anotação de Sequência Molecular , Algoritmos , Análise de Sequência de Proteína
3.
Methods Mol Biol ; 2758: 61-75, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38549008

RESUMO

Natural peptides secreted under stress conditions by many organisms are bioactive molecules with a broad spectrum of activities. These molecules could become potential models for novel pharmaceuticals, to which bacteria, according to modern scientific concepts, do not have and cannot develop resistance. Taking this into consideration, it is necessary to clarify the amino acid sequences of such peptides. Here we describe our approach to de novo sequencing of amphibians' skin secretion peptides.


Assuntos
Análise de Sequência de Proteína , Espectrometria de Massas em Tandem , Espectrometria de Massas em Tandem/métodos , Análise de Sequência de Proteína/métodos , Peptídeos/química , Sequência de Aminoácidos
4.
Science ; 383(6689): eadg4320, 2024 Mar 22.
Artigo em Inglês | MEDLINE | ID: mdl-38513038

RESUMO

Many clinically used drugs are derived from or inspired by bacterial natural products that often are produced through nonribosomal peptide synthetases (NRPSs), megasynthetases that activate and join individual amino acids in an assembly line fashion. In this work, we describe a detailed phylogenetic analysis of several bacterial NRPSs that led to the identification of yet undescribed recombination sites within the thiolation (T) domain that can be used for NRPS engineering. We then developed an evolution-inspired "eXchange Unit between T domains" (XUT) approach, which allows the assembly of NRPS fragments over a broad range of GC contents, protein similarities, and extender unit specificities, as demonstrated for the specific production of a proteasome inhibitor designed and assembled from five different NRPS fragments.


Assuntos
Proteínas de Bactérias , Evolução Molecular , Peptídeo Sintases , Engenharia de Proteínas , Peptídeo Sintases/química , Peptídeo Sintases/classificação , Peptídeo Sintases/genética , Filogenia , Sequência de Aminoácidos/genética , Proteínas de Bactérias/química , Proteínas de Bactérias/classificação , Proteínas de Bactérias/genética , Análise de Sequência de Proteína
5.
J Pharm Biomed Anal ; 243: 116094, 2024 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-38479303

RESUMO

BACKGROUND: Tandem mass spectrometry (MS/MS) can provide direct and accurate sequence characterization of synthetic peptide drugs, and peptide drug products including side chain modifications in the Peptide drugs. This article explains a step-by-step guide to developing a high-throughput method using high resolution mass spectrometry for characterization of Calcitonin Salmon injection containing high proportion of UV-active excipients. METHODS: The major challenge in the method development of Amino acid sequencing and Peptide mapping was presence of phenol in drug product. Phenol is a UV-active excipient and reacts with both Dithiothreitol (DTT) and Trypsin. Hence Calcitonin Salmon was extracted from the Calcitonin Salmon injection using solid phase extraction after the extraction, Amino acid sequencing and peptide mapping study was performed. Upon incubation of Calcitonin Salmon with Trypsin and DTT, digested fragments were generated which were separated by mass compatible reverse phase chromatography and the molecular mass of each fragment was determined using HRMS. RESULTS: A reverse phase chromatographic method was developed using UHPLC-HRMS for the determination of direct mass, peptide mapping and to determine the amino acid sequencing in the Calcitonin Salmon injection. The method was found Specific and fragments after trypsin digest are well resolved from each other and the molecular mass of each fragment was determined using HRMS. Sequencing was performed using automated identification of b and y ions annotation and identifications based on MS/MS spectra using Biopharma finder and Proteome discoverer software. CONCLUSION: Using this approach 100% protein coverage was obtained and protein was identified as Calcitonin Salmon and the observed masses of tryptic digest of peptide was found similar with theoretical masses. The method can be used for both UV and MS based Peptide mapping and whereas the UV based peptide mapping method can be used as identification test for Calcitonin Salmon drug substance and drug product in quality control.


Assuntos
Calcitonina , Peptídeos , Espectrometria de Massas em Tandem , Mapeamento de Peptídeos , Cromatografia Líquida de Alta Pressão/métodos , Espectrometria de Massas em Tandem/métodos , Sequência de Aminoácidos , Tripsina/metabolismo , Análise de Sequência de Proteína , Proteoma , Fenóis
6.
PLoS Comput Biol ; 20(2): e1011892, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-38416757

RESUMO

In proteomics, a crucial aspect is to identify peptide sequences. De novo sequencing methods have been widely employed to identify peptide sequences, and numerous tools have been proposed over the past two decades. Recently, deep learning approaches have been introduced for de novo sequencing. Previous methods focused on encoding tandem mass spectra and predicting peptide sequences from the first amino acid onwards. However, when predicting peptides using tandem mass spectra, the peptide sequence can be predicted not only from the first amino acid but also from the last amino acid due to the coexistence of b-ion (or a- or c-ion) and y-ion (or x- or z-ion) fragments in the tandem mass spectra. Therefore, it is essential to predict peptide sequences bidirectionally. Our approach, called NovoB, utilizes a Transformer model to predict peptide sequences bidirectionally, starting with both the first and last amino acids. In comparison to Casanovo, our method achieved an improvement of the average peptide-level accuracy rate of approximately 9.8% across all species.


Assuntos
Algoritmos , Análise de Sequência de Proteína , Análise de Sequência de Proteína/métodos , Peptídeos/química , Sequência de Aminoácidos , Aminoácidos
7.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38340092

RESUMO

De novo peptide sequencing is a promising approach for novel peptide discovery, highlighting the performance improvements for the state-of-the-art models. The quality of mass spectra often varies due to unexpected missing of certain ions, presenting a significant challenge in de novo peptide sequencing. Here, we use a novel concept of complementary spectra to enhance ion information of the experimental spectrum and demonstrate it through conceptual and practical analyses. Afterward, we design suitable encoders to encode the experimental spectrum and the corresponding complementary spectrum and propose a de novo sequencing model $\pi$-HelixNovo based on the Transformer architecture. We first demonstrated that $\pi$-HelixNovo outperforms other state-of-the-art models using a series of comparative experiments. Then, we utilized $\pi$-HelixNovo to de novo gut metaproteome peptides for the first time. The results show $\pi$-HelixNovo increases the identification coverage and accuracy of gut metaproteome and enhances the taxonomic resolution of gut metaproteome. We finally trained a powerful $\pi$-HelixNovo utilizing a larger training dataset, and as expected, $\pi$-HelixNovo achieves unprecedented performance, even for peptide-spectrum matches with never-before-seen peptide sequences. We also use the powerful $\pi$-HelixNovo to identify antibody peptides and multi-enzyme cleavage peptides, and $\pi$-HelixNovo is highly robust in these applications. Our results demonstrate the effectivity of the complementary spectrum and take a significant step forward in de novo peptide sequencing.


Assuntos
Análise de Sequência de Proteína , Espectrometria de Massas em Tandem , Espectrometria de Massas em Tandem/métodos , Análise de Sequência de Proteína/métodos , Peptídeos , Sequência de Aminoácidos , Anticorpos , Algoritmos
8.
Nat Commun ; 15(1): 151, 2024 Jan 02.
Artigo em Inglês | MEDLINE | ID: mdl-38167372

RESUMO

Unlike for DNA and RNA, accurate and high-throughput sequencing methods for proteins are lacking, hindering the utility of proteomics in applications where the sequences are unknown including variant calling, neoepitope identification, and metaproteomics. We introduce Spectralis, a de novo peptide sequencing method for tandem mass spectrometry. Spectralis leverages several innovations including a convolutional neural network layer connecting peaks in spectra spaced by amino acid masses, proposing fragment ion series classification as a pivotal task for de novo peptide sequencing, and a peptide-spectrum confidence score. On spectra for which database search provided a ground truth, Spectralis surpassed 40% sensitivity at 90% precision, nearly doubling state-of-the-art sensitivity. Application to unidentified spectra confirmed its superiority and showcased its applicability to variant calling. Altogether, these algorithmic innovations and the substantial sensitivity increase in the high-precision range constitute an important step toward broadly applicable peptide sequencing.


Assuntos
Aprendizado Profundo , Algoritmos , Análise de Sequência de Proteína/métodos , Peptídeos/química , Sequência de Aminoácidos
9.
Comput Biol Med ; 170: 107956, 2024 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-38217977

RESUMO

The classification and prediction of T-cell receptors (TCRs) protein sequences are of significant interest in understanding the immune system and developing personalized immunotherapies. In this study, we propose a novel approach using Pseudo Amino Acid Composition (PseAAC) protein encoding for accurate TCR protein sequence classification. The PseAAC2Vec encoding method captures the physicochemical properties of amino acids and their local sequence information, enabling the representation of protein sequences as fixed-length feature vectors. By incorporating physicochemical properties such as hydrophobicity, polarity, charge, molecular weight, and solvent accessibility, PseAAC2Vec provides a comprehensive and informative characterization of TCR protein sequences. To evaluate the effectiveness of the proposed PseAAC2Vec encoding approach, we assembled a large dataset of TCR protein sequences with annotated classes. We applied the PseAAC2Vec encoding scheme to each sequence and generated feature vectors based on a specified window size. Subsequently, we employed state-of-the-art machine learning algorithms, such as support vector machines (SVM) and random forests (RF), to classify the TCR protein sequences. Experimental results on the benchmark dataset demonstrated the superior performance of the PseAAC2Vec-based approach compared to existing methods. The PseAAC2Vec encoding effectively captures the discriminative patterns in TCR protein sequences, leading to improved classification accuracy and robustness. Furthermore, the encoding scheme showed promising results across different window sizes, indicating its adaptability to varying sequence contexts.


Assuntos
Biologia Computacional , Proteínas , Biologia Computacional/métodos , Proteínas/química , Sequência de Aminoácidos , Aminoácidos/química , Aminoácidos/metabolismo , Algoritmos , Máquina de Vetores de Suporte , Análise de Sequência de Proteína/métodos , Bases de Dados de Proteínas
10.
J Mol Biol ; 436(2): 168393, 2024 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-38065275

RESUMO

Many proteins contain cleavable signal or transit peptides that direct them to their final subcellular locations. Such peptides are usually predicted from sequence alone using methods such as TargetP 2.0 and SignalP 6.0. While these methods are usually very accurate, we show here that an analysis of a protein's AlphaFold2-predicted structure can often be used to identify false positive predictions. We start by showing that when given a protein's full-length sequence, AlphaFold2 builds experimentally annotated signal and transit peptides in orientations that point away from the main body of the protein. This indicates that AlphaFold2 correctly identifies that a signal is not destined to be part of the mature protein's structure and suggests, as a corollary, that predicted signals that AlphaFold2 folds with high confidence into the main body of the protein are likely to be false positives. To explore this idea, we analyzed predicted signal peptides in 48 proteomes made available in DeepMind's AlphaFold2 database (https://alphafold.ebi.ac.uk). Applying TargetP 2.0 and SignalP 6.0 to the 561,562 proteins in the database results in 95,236 being predicted to contain a cleavable signal or transit peptide. In 95.1% of these cases, the AlphaFold2 structure of the full-length protein is fully consistent with the prediction of TargetP 2.0 or SignalP 6.0. In the remaining 4.9% of cases where the AlphaFold2 structure does not appear consistent with the prediction, the signal is often only predicted with low confidence. The potential false positives identified here may be useful for training even more accurate signal prediction methods.


Assuntos
Sinais Direcionadores de Proteínas , Análise de Sequência de Proteína , Algoritmos , Sequência de Aminoácidos , Proteoma/metabolismo , Análise de Sequência de Proteína/métodos
11.
Biochim Biophys Acta Proteins Proteom ; 1872(2): 140985, 2024 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-38122964

RESUMO

MOTIVATION: The growth of unannotated proteins in UniProt increases at a very high rate every year due to more efficient sequencing methods. However, the experimental annotation of proteins is a lengthy and expensive process. Using computational techniques to narrow the search can speed up the process by providing highly specific Gene Ontology (GO) terms. METHODOLOGY: We propose an ensemble approach that combines three generic base predictors that predict Gene Ontology (BP, CC and MF) terms from sequences across different species. We train our models on UniProtGOA annotation data and use the CATH domain resources to identify the protein families. We then calculate a score based on the prevalence of individual GO terms in the functional families that is then used as an indicator of confidence when assigning the GO term to an uncharacterised protein. METHODS: In the ensemble, we use a statistics-based method that scores the occurrence of GO terms in a CATH FunFam against a background set of proteins annotated by the same GO term. We also developed a set-based method that uses Set Intersection and Set Union to score the occurrence of GO terms within the same CATH FunFam. Finally, we also use FunFams-Plus, a predictor method developed by the Orengo Group at UCL to predict GO terms for uncharacterised proteins in the CAFA3 challenge. EVALUATION: We evaluated the methods against the CAFA3 benchmark and DomFun. We used the Precision, Recall and Fmax metrics and the benchmark datasets that are used in CAFA3 to evaluate our models and compare them to the CAFA3 results. Our results show that FunPredCATH compares well with top CAFA methods in the different ontologies and benchmarks. CONTRIBUTIONS: FunPredCATH compares well with other prediction methods on CAFA3, and the ensemble approach outperforms the base methods. We show that non-IEA models obtain higher Fmax scores than the IEA counterparts, while the models including IEA annotations have higher coverage at the expense of a lower Fmax score.


Assuntos
Proteínas , Análise de Sequência de Proteína , Bases de Dados de Proteínas , Proteínas/metabolismo , Anotação de Sequência Molecular , Análise de Sequência de Proteína/métodos , Ontologia Genética
13.
Nat Commun ; 14(1): 7974, 2023 Dec 02.
Artigo em Inglês | MEDLINE | ID: mdl-38042873

RESUMO

De novo peptide sequencing, which does not rely on a comprehensive target sequence database, provides us with a way to identify novel peptides from tandem mass spectra. However, current de novo sequencing algorithms suffer from low accuracy and coverage, which hinders their application in proteomics. In this paper, we present PepNet, a fully convolutional neural network for high accuracy de novo peptide sequencing. PepNet takes an MS/MS spectrum (represented as a high-dimensional vector) as input, and outputs the optimal peptide sequence along with its confidence score. The PepNet model is trained using a total of 3 million high-energy collisional dissociation MS/MS spectra from multiple human peptide spectral libraries. Evaluation results show that PepNet significantly outperforms current best-performing de novo sequencing algorithms (e.g. PointNovo and DeepNovo) in both peptide-level accuracy and positional-level accuracy. PepNet can sequence a large fraction of spectra that were not identified by database search engines, and thus could be used as a complementary tool to database search engines for peptide identification in proteomics. In addition, PepNet runs around 3x and 7x faster than PointNovo and DeepNovo on GPUs, respectively, thus being more suitable for the analysis of large-scale proteomics data.


Assuntos
Análise de Sequência de Proteína , Espectrometria de Massas em Tandem , Humanos , Espectrometria de Massas em Tandem/métodos , Análise de Sequência de Proteína/métodos , Peptídeos , Sequência de Aminoácidos , Redes Neurais de Computação , Algoritmos , Biblioteca de Peptídeos
14.
Molecules ; 28(20)2023 Oct 16.
Artigo em Inglês | MEDLINE | ID: mdl-37894596

RESUMO

Peptides released on frogs' skin in a stress situation represent their only weapon against micro-organisms and predators. Every species and even population of frog possesses its own peptidome being appropriate for their habitat. Skin peptides are considered potential pharmaceuticals, while the whole peptidome may be treated as a taxonomic characteristic of each particular population. Continuing the studies on frog peptides, here we report the peptidome composition of the Central Slovenian agile frog Rana dalmatina population. The detection and top-down de novo sequencing of the corresponding peptides was conducted exclusively by tandem mass spectrometry without using any chemical derivatization procedures. Collision-induced dissociation (CID), higher energy collision-induced dissociation (HCD), electron transfer dissociation (ETD) and combined MS3 method EThcD with stepwise increase of HCD energy were used for that purpose. MS/MS revealed the whole sequence of the detected peptides including differentiation between isomeric Leu/Ile, and the sequence portion hidden in the disulfide cycle. The array of the discovered peptide families (brevinins 1 and 2, melittin-related peptides (MRPs), temporins and bradykinin-related peptides (BRPs)) is quite similar to that of R. temporaria. Since the genome of this frog remains unknown, the obtained results were compared with the recently published transcriptome of R. dalmatina.


Assuntos
Ranidae , Espectrometria de Massas em Tandem , Humanos , Animais , Espectrometria de Massas em Tandem/métodos , Sequência de Aminoácidos , Anuros , Análise de Sequência de Proteína/métodos , Pele/química
15.
Proteomics ; 23(23-24): e2200494, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-37863817

RESUMO

Membrane proteins play a crucial role in various cellular processes and are essential components of cell membranes. Computational methods have emerged as a powerful tool for studying membrane proteins due to their complex structures and properties that make them difficult to analyze experimentally. Traditional features for protein sequence analysis based on amino acid types, composition, and pair composition have limitations in capturing higher-order sequence patterns. Recently, multiple sequence alignment (MSA) and pre-trained language models (PLMs) have been used to generate features from protein sequences. However, the significant computational resources required for MSA-based features generation can be a major bottleneck for many applications. Several methods and tools have been developed to accelerate the generation of MSAs and reduce their computational cost, including heuristics and approximate algorithms. Additionally, the use of PLMs such as BERT has shown great potential in generating informative embeddings for protein sequence analysis. In this review, we provide an overview of traditional and more recent methods for generating features from protein sequences, with a particular focus on MSAs and PLMs. We highlight the advantages and limitations of these approaches and discuss the methods and tools developed to address the computational challenges associated with features generation. Overall, the advancements in computational methods and tools provide a promising avenue for gaining deeper insights into the function and properties of membrane proteins, which can have significant implications in drug discovery and personalized medicine.


Assuntos
Algoritmos , Proteínas de Membrana , Animais , Cavalos , Alinhamento de Sequência , Sequência de Aminoácidos , Análise de Sequência de Proteína , Biologia Computacional/métodos
16.
Brief Bioinform ; 24(6)2023 09 22.
Artigo em Inglês | MEDLINE | ID: mdl-37833837

RESUMO

Protein remote homology detection is essential for structure prediction, function prediction, disease mechanism understanding, etc. The remote homology relationship depends on multiple protein properties, such as structural information and local sequence patterns. Previous studies have shown the challenges for predicting remote homology relationship by protein features at sequence level (e.g. position-specific score matrix). Protein motifs have been used in structure and function analysis due to their unique sequence patterns and implied structural information. Therefore, designing a usable architecture to fuse multiple protein properties based on motifs is urgently needed to improve protein remote homology detection performance. To make full use of the characteristics of motifs, we employed the language model called the protein cubic language model (PCLM). It combines multiple properties by constructing a motif-based neural network. Based on the PCLM, we proposed a predictor called PreHom-PCLM by extracting and fusing multiple motif features for protein remote homology detection. PreHom-PCLM outperforms the other state-of-the-art methods on the test set and independent test set. Experimental results further prove the effectiveness of multiple features fused by PreHom-PCLM for remote homology detection. Furthermore, the protein features derived from the PreHom-PCLM show strong discriminative power for proteins from different structural classes in the high-dimensional space. Availability and Implementation: http://bliulab.net/PreHom-PCLM.


Assuntos
Algoritmos , Proteínas , Proteínas/química , Redes Neurais de Computação , Motivos de Aminoácidos , Idioma , Análise de Sequência de Proteína/métodos
17.
Mol Inform ; 42(11): e202300104, 2023 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-37672879

RESUMO

Cell-Penetrating Peptides (CPP) are emerging as an alternative to small-molecule drugs to expand the range of biomolecules that can be targeted for therapeutic purposes. Due to the importance of identifying and designing new CPP, a great variety of predictors have been developed to achieve these goals. To establish a ranking for these predictors, a couple of recent studies compared their performances on specific datasets, yet their conclusions cannot determine if the ranking obtained is due to the model, the set of descriptors or the datasets used to test the predictors. We present a systematic study of the influence of the peptide sequence's similarity of the datasets on the predictors' performance. The analysis reveals that the datasets used for training have a stronger influence on the predictors performance than the model or descriptors employed. We show that datasets with low sequence similarity between the positive and negative examples can be easily separated, and the tested classifiers showed good performance on them. On the other hand, a dataset with high sequence similarity between CPP and non-CPP will be a hard dataset, and it should be the one to be used for assessing the performance of new predictors.


Assuntos
Peptídeos Penetradores de Células , Peptídeos Penetradores de Células/química , Biologia Computacional/métodos , Análise de Sequência de Proteína
18.
Anal Chem ; 95(28): 10610-10617, 2023 07 18.
Artigo em Inglês | MEDLINE | ID: mdl-37424072

RESUMO

Alternative splicing allows a small number of human genes to encode large amounts of proteoforms that play essential roles in normal and disease physiology. Some low-abundance proteoforms may remain undiscovered due to limited detection and analysis capabilities. Peptides coencoded by novel exons and annotated exons separated by introns are called novel junction peptides, which are the key to identifying novel proteoforms. Traditional de novo sequencing does not take into account the specificity in the composition of the novel junction peptide and is therefore not as accurate. We first developed a novel de novo sequencing algorithm, CNovo, which outperformed the mainstream PEAKS and Novor in all six test sets. We then built on CNovo to develop a semi-de novo sequencing algorithm, SpliceNovo, specifically for identifying novel junction peptides. SpliceNovo identifies junction peptides with much higher accuracy than CNovo, CJunction, PEAKS, and Novor. Of course, it is also possible to replace the built-in CNovo in SpliceNovo with other more accurate de novo sequencing algorithms to further improve its performance. We also successfully identified and validated two novel proteoforms of the human EIF4G1 and ELAVL1 genes by SpliceNovo. Our results significantly improve the ability to discover novel proteoforms through de novo sequencing.


Assuntos
Algoritmos , Peptídeos , Humanos , Peptídeos/genética , Peptídeos/química , Análise de Sequência , Éxons , Íntrons , Análise de Sequência de Proteína/métodos
19.
J Mol Biol ; 435(18): 168209, 2023 09 15.
Artigo em Inglês | MEDLINE | ID: mdl-37479080

RESUMO

Characterizing the effects of mutations on stability is critical for understanding the function and evolution of proteins and improving their biophysical properties. High throughput folding and abundance assays have been successfully used to characterize missense mutations associated with reduced stability. However, screening for increased thermodynamic stability is more challenging since such mutations are rarer and their impact on assay readout is more subtle. Here, a multiplex assay for high throughput screening of protein folding was developed by combining deep mutational scanning, fluorescence-activated cell sorting, and deep sequencing. By analyzing a library of 2000 variants of Adenylate kinase we demonstrate that the readout of the method correlates with stability and that mutants with up to 13 °C increase in thermal melting temperature could be identified with low false positive rate. The discovery of many stabilizing mutations also enabled the analysis of general substitution patterns associated with increased stability in Adenylate kinase. This high throughput method to identify stabilizing mutations can be combined with functional screens to identify mutations that improve both stability and activity.


Assuntos
Sequência de Aminoácidos , Mutação de Sentido Incorreto , Dobramento de Proteína , Estabilidade Proteica , Análise de Sequência de Proteína , Adenilato Quinase/química , Adenilato Quinase/genética , Sequência de Aminoácidos/genética , Ensaios de Triagem em Larga Escala/métodos , Análise de Sequência de Proteína/métodos , Temperatura
20.
Curr Protein Pept Sci ; 24(6): 477-487, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37287293

RESUMO

Most of the currently available knowledge about protein structure and function has been obtained from laboratory experiments. As a complement to this classical knowledge discovery activity, bioinformatics-assisted sequence analysis, which relies primarily on biological data manipulation, is becoming an indispensable option for the modern discovery of new knowledge, especially when large amounts of protein-encoding sequences can be easily identified from the annotation of highthroughput genomic data. Here, we review the advances in bioinformatics-assisted protein sequence analysis to highlight how bioinformatics analysis will aid in understanding protein structure and function. We first discuss the analyses with individual protein sequences as input, from which some basic parameters of proteins (e.g., amino acid composition, MW and PTM) can be predicted. In addition to these basic parameters that can be directly predicted by analyzing a protein sequence alone, many predictions are based on principles drawn from knowledge of many well-studied proteins, with multiple sequence comparisons as input. Identification of conserved sites by comparing multiple homologous sequences, prediction of the folding, structure or function of uncharacterized proteins, construction of phylogenies of related sequences, analysis of the contribution of conserved related sites to protein function by SCA or DCA, elucidation of the significance of codon usage, and extraction of functional units from protein sequences and coding spaces belong to this category. We then discuss the revolutionary invention of the "QTY code" that can be applied to convert membrane proteins into water- soluble proteins but at the cost of marginal introduced structural and functional changes. As machine learning has been done in other scientific fields, machine learning has profoundly impacted protein sequence analysis. In summary, we have highlighted the relevance of the bioinformatics-assisted analysis for protein research as a valuable guide for laboratory experiments.


Assuntos
Proteínas , Análise de Sequência de Proteína , Proteínas/química , Sequência de Aminoácidos , Aminoácidos , Biologia Computacional
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...